• DOMAIN: Automobile

• CONTEXT: The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette.

The vehicle may be viewed from one of many different angles.

• DATA DESCRIPTION: The data contains features extracted from the silhouette of vehicles in different angles. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars

• PROJECT OBJECTIVE: Apply dimensionality reduction technique – PCA and train a model using principal components instead of training the model using just the raw data.

  1. Data: Import, clean and pre-process the data

2.EDA and visualisation: Create a detailed performance report using univariate, bi-variate and multivariate EDA techniques. Find out all possible hidden patterns by using all possible methods.

In [1]:
import pandas as pd
import seaborn as sns
import numpy as  np

from sklearn.preprocessing import StandardScaler 
sc=StandardScaler()
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC


import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
C:\Users\HP\Anaconda3\lib\site-packages\statsmodels\tools\_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm
In [2]:
df=pd.read_csv("C:/Users/HP/Downloads/vehicle-1 (1).csv")
In [3]:
df
Out[3]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 van
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 van
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196 car
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 van
4 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183 bus
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
841 93 39.0 87.0 183.0 64.0 8 169.0 40.0 20.0 134 200.0 422.0 149.0 72.0 7.0 25.0 188.0 195 car
842 89 46.0 84.0 163.0 66.0 11 159.0 43.0 20.0 159 173.0 368.0 176.0 72.0 1.0 20.0 186.0 197 van
843 106 54.0 101.0 222.0 67.0 12 222.0 30.0 25.0 173 228.0 721.0 200.0 70.0 3.0 4.0 187.0 201 car
844 86 36.0 78.0 146.0 58.0 7 135.0 50.0 18.0 124 155.0 270.0 148.0 66.0 0.0 25.0 190.0 195 car
845 85 36.0 66.0 123.0 55.0 5 120.0 56.0 17.0 128 140.0 212.0 131.0 73.0 1.0 18.0 186.0 190 van

846 rows × 19 columns

In [4]:
df["class"].unique() 
Out[4]:
array(['van', 'car', 'bus'], dtype=object)
In [5]:
df.describe().transpose()
Out[5]:
count mean std min 25% 50% 75% max
compactness 846.0 93.678487 8.234474 73.0 87.00 93.0 100.0 119.0
circularity 841.0 44.828775 6.152172 33.0 40.00 44.0 49.0 59.0
distance_circularity 842.0 82.110451 15.778292 40.0 70.00 80.0 98.0 112.0
radius_ratio 840.0 168.888095 33.520198 104.0 141.00 167.0 195.0 333.0
pr.axis_aspect_ratio 844.0 61.678910 7.891463 47.0 57.00 61.0 65.0 138.0
max.length_aspect_ratio 846.0 8.567376 4.601217 2.0 7.00 8.0 10.0 55.0
scatter_ratio 845.0 168.901775 33.214848 112.0 147.00 157.0 198.0 265.0
elongatedness 845.0 40.933728 7.816186 26.0 33.00 43.0 46.0 61.0
pr.axis_rectangularity 843.0 20.582444 2.592933 17.0 19.00 20.0 23.0 29.0
max.length_rectangularity 846.0 147.998818 14.515652 118.0 137.00 146.0 159.0 188.0
scaled_variance 843.0 188.631079 31.411004 130.0 167.00 179.0 217.0 320.0
scaled_variance.1 844.0 439.494076 176.666903 184.0 318.00 363.5 587.0 1018.0
scaled_radius_of_gyration 844.0 174.709716 32.584808 109.0 149.00 173.5 198.0 268.0
scaled_radius_of_gyration.1 842.0 72.447743 7.486190 59.0 67.00 71.5 75.0 135.0
skewness_about 840.0 6.364286 4.920649 0.0 2.00 6.0 9.0 22.0
skewness_about.1 845.0 12.602367 8.936081 0.0 5.00 11.0 19.0 41.0
skewness_about.2 845.0 188.919527 6.155809 176.0 184.00 188.0 193.0 206.0
hollows_ratio 846.0 195.632388 7.438797 181.0 190.25 197.0 201.0 211.0
In [6]:
df.isnull().sum()
Out[6]:
compactness                    0
circularity                    5
distance_circularity           4
radius_ratio                   6
pr.axis_aspect_ratio           2
max.length_aspect_ratio        0
scatter_ratio                  1
elongatedness                  1
pr.axis_rectangularity         3
max.length_rectangularity      0
scaled_variance                3
scaled_variance.1              2
scaled_radius_of_gyration      2
scaled_radius_of_gyration.1    4
skewness_about                 6
skewness_about.1               1
skewness_about.2               1
hollows_ratio                  0
class                          0
dtype: int64
In [7]:
df = df.replace('?', np.nan)
In [8]:
df.shape
Out[8]:
(846, 19)
In [9]:
df[df.isnull().any(axis=1)]
Out[9]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
5 107 NaN 106.0 172.0 50.0 6 255.0 26.0 28.0 169 280.0 957.0 264.0 85.0 5.0 9.0 181.0 183 bus
9 93 44.0 98.0 NaN 62.0 11 183.0 36.0 22.0 146 202.0 505.0 152.0 64.0 4.0 14.0 195.0 204 car
19 101 56.0 100.0 215.0 NaN 10 208.0 32.0 24.0 169 227.0 651.0 223.0 74.0 6.0 5.0 186.0 193 car
35 100 46.0 NaN 172.0 67.0 9 157.0 43.0 20.0 150 170.0 363.0 184.0 67.0 17.0 7.0 192.0 200 van
66 81 43.0 68.0 125.0 57.0 8 149.0 46.0 19.0 146 169.0 323.0 172.0 NaN NaN 18.0 179.0 184 bus
70 96 55.0 98.0 161.0 54.0 10 215.0 31.0 NaN 175 226.0 683.0 221.0 76.0 3.0 6.0 185.0 193 car
77 86 40.0 62.0 140.0 62.0 7 150.0 45.0 19.0 133 165.0 330.0 173.0 NaN 2.0 3.0 180.0 185 car
78 104 52.0 94.0 NaN 66.0 5 208.0 31.0 24.0 161 227.0 666.0 218.0 76.0 11.0 4.0 193.0 191 bus
105 108 NaN 103.0 202.0 64.0 10 220.0 30.0 25.0 168 NaN 711.0 214.0 73.0 11.0 NaN 188.0 199 car
118 85 NaN NaN 128.0 56.0 8 150.0 46.0 19.0 144 168.0 324.0 173.0 82.0 9.0 14.0 180.0 184 bus
141 81 42.0 63.0 125.0 55.0 8 149.0 46.0 19.0 145 166.0 320.0 172.0 86.0 NaN 7.0 179.0 182 bus
159 91 45.0 75.0 NaN 57.0 6 150.0 44.0 19.0 146 170.0 335.0 180.0 66.0 16.0 2.0 193.0 198 car
177 89 44.0 72.0 160.0 66.0 7 144.0 46.0 19.0 147 166.0 312.0 169.0 69.0 NaN 1.0 191.0 198 bus
192 93 43.0 76.0 149.0 57.0 7 149.0 44.0 19.0 143 172.0 335.0 176.0 NaN 14.0 0.0 189.0 194 car
207 85 42.0 NaN 121.0 55.0 7 149.0 46.0 19.0 146 167.0 323.0 NaN 85.0 1.0 6.0 179.0 182 bus
215 90 39.0 86.0 169.0 62.0 7 162.0 NaN 20.0 131 194.0 388.0 147.0 74.0 1.0 22.0 185.0 191 car
222 100 50.0 81.0 197.0 NaN 6 186.0 34.0 22.0 158 206.0 531.0 198.0 74.0 NaN 1.0 197.0 198 bus
237 85 45.0 65.0 128.0 56.0 8 151.0 45.0 NaN 145 170.0 332.0 186.0 81.0 1.0 10.0 179.0 184 bus
249 85 34.0 53.0 127.0 58.0 6 NaN 58.0 17.0 121 137.0 197.0 127.0 70.0 NaN 20.0 185.0 189 car
266 86 NaN 65.0 116.0 53.0 6 152.0 45.0 19.0 141 175.0 335.0 NaN 85.0 5.0 4.0 179.0 183 bus
273 96 45.0 80.0 162.0 63.0 9 146.0 46.0 NaN 148 161.0 316.0 161.0 64.0 5.0 10.0 199.0 207 van
285 89 48.0 85.0 189.0 64.0 8 169.0 39.0 20.0 153 188.0 427.0 190.0 64.0 NaN 5.0 195.0 201 car
287 88 43.0 84.0 NaN 55.0 11 154.0 44.0 19.0 150 174.0 350.0 164.0 73.0 6.0 2.0 185.0 196 van
308 109 51.0 100.0 197.0 59.0 10 192.0 34.0 22.0 161 210.0 NaN 195.0 64.0 14.0 3.0 196.0 202 car
319 102 51.0 NaN 194.0 60.0 6 220.0 30.0 25.0 162 247.0 731.0 209.0 80.0 7.0 7.0 188.0 186 bus
329 89 38.0 80.0 169.0 59.0 7 161.0 41.0 20.0 131 186.0 389.0 137.0 NaN 5.0 15.0 192.0 197 car
345 101 54.0 106.0 NaN 57.0 7 236.0 28.0 26.0 164 256.0 833.0 253.0 81.0 6.0 14.0 185.0 185 bus
372 97 47.0 87.0 164.0 64.0 9 156.0 43.0 20.0 149 NaN 359.0 182.0 68.0 1.0 13.0 192.0 202 van
396 108 NaN 106.0 177.0 51.0 5 256.0 26.0 28.0 170 285.0 966.0 261.0 87.0 11.0 2.0 182.0 181 bus
419 93 34.0 72.0 144.0 56.0 6 133.0 50.0 18.0 123 158.0 263.0 125.0 63.0 5.0 20.0 NaN 206 car
467 96 54.0 104.0 NaN 58.0 10 215.0 31.0 24.0 175 221.0 682.0 222.0 75.0 13.0 23.0 186.0 194 car
496 106 55.0 98.0 224.0 68.0 11 215.0 31.0 24.0 170 222.0 NaN 214.0 68.0 2.0 29.0 189.0 201 car
522 89 36.0 69.0 162.0 63.0 6 140.0 48.0 18.0 131 NaN 291.0 126.0 66.0 1.0 38.0 193.0 204 car
In [10]:
df.median()
Out[10]:
compactness                     93.0
circularity                     44.0
distance_circularity            80.0
radius_ratio                   167.0
pr.axis_aspect_ratio            61.0
max.length_aspect_ratio          8.0
scatter_ratio                  157.0
elongatedness                   43.0
pr.axis_rectangularity          20.0
max.length_rectangularity      146.0
scaled_variance                179.0
scaled_variance.1              363.5
scaled_radius_of_gyration      173.5
scaled_radius_of_gyration.1     71.5
skewness_about                   6.0
skewness_about.1                11.0
skewness_about.2               188.0
hollows_ratio                  197.0
dtype: float64
In [11]:
df_var=df.drop('class',axis=1)
df_var=df_var.apply(lambda x: x.fillna(x.median()),axis=0)

Since there are two types are cars sub-groups within the CARS class type we are going to use clustering to extract them

In [12]:
filt=(df['class']!='car')
df_others=df.loc[filt]
In [13]:
filt=(df['class']=='car')
df_car=df.loc[filt]
In [14]:
df_car_var=df_car.drop('class',axis=1)
df_car_var=df_car_var.apply(lambda x: x.fillna(x.median()),axis=0)
In [15]:
from scipy.stats import zscore
df_car_var_z = df_car_var.apply(zscore)
X=df_car_var_z
In [16]:
X
Out[16]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio
2 0.896094 0.564590 1.051486 0.916016 1.083514 0.569150 0.811345 -0.802204 0.595992 0.467751 0.887284 0.803126 1.182048 0.597495 1.248361 -0.611238 -0.279797 -0.253357
9 -0.365065 -0.289763 0.560172 0.173404 0.217914 1.053606 0.060594 -0.275182 0.195239 -0.231025 0.143975 0.026312 -0.808181 -1.151503 -0.566437 -0.115733 1.049683 1.027235
11 -0.709018 -1.713686 -1.405082 -1.440970 -1.296888 -1.368671 -1.816284 2.096419 -1.808528 -1.861502 -1.767389 -1.652803 -1.803295 -0.957170 -0.384957 1.073478 1.239609 0.707087
15 -0.021113 1.276552 0.867243 0.657716 0.867114 0.084695 0.717501 -0.802204 0.595992 0.933602 1.028866 0.737396 1.943017 0.791828 -0.203477 -1.304945 -0.659649 -0.573505
18 0.896094 1.134160 0.683000 0.173404 0.001513 0.569150 1.092877 -0.933960 0.996745 1.341221 0.958075 1.107876 1.182048 0.791828 -0.384957 -0.413036 -0.849575 -0.413431
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
837 -0.250415 -0.004979 -0.729526 -0.375483 -0.214887 -0.399760 -0.721439 0.515352 -0.606268 -0.114562 -0.599333 -0.762454 0.040593 -0.568504 0.885401 -1.304945 0.669832 0.226865
840 -0.365065 -1.713686 -1.405082 -1.311820 -1.080488 -0.884216 -1.597315 1.701152 -1.407774 -1.745040 -1.661202 -1.491464 -1.920367 -1.540169 -0.384957 1.370781 2.189238 1.507457
841 -0.365065 -1.001725 -0.115384 0.076542 0.650714 -0.399760 -0.377344 0.251841 -0.606268 -0.929801 0.073184 -0.469654 -0.895985 0.403162 -0.021998 0.974377 -0.279797 -0.413431
843 1.125395 1.134160 0.744415 1.335753 1.299915 1.538061 1.280565 -1.065716 1.397499 1.341221 1.064262 1.317019 0.596686 0.014496 -0.747917 -1.106743 -0.469723 0.547013
844 -1.167621 -1.428902 -0.668112 -1.118095 -0.647687 -0.884216 -1.440909 1.569397 -1.407774 -1.512114 -1.519620 -1.377930 -0.925253 -0.762837 -1.292356 0.974377 0.100054 -0.413431

429 rows × 18 columns

In [17]:
from sklearn.cluster import KMeans
from scipy.spatial import distance

# Let us check optimal number of clusters-
distortion = []

cluster_range = range( 1, 10)   # expect 3 to four clusters from the pair panel visual inspection hence restricting from 2 to 6
cluster_errors = []
cluster_sil_scores = []
for num_clusters in cluster_range:
  clusters = KMeans( num_clusters, n_init = 5)
  clusters.fit(X)
  centroids = clusters.cluster_centers_         
  cluster_errors.append( clusters.inertia_ )    
  distortion.append(sum(np.min(distance.cdist(X, clusters.cluster_centers_, 'euclidean'), axis=1))/ X.shape[0])
clusters_df = pd.DataFrame( { "num_clusters":cluster_range, "cluster_errors": cluster_errors} )
clusters_df[0:15]
Out[17]:
num_clusters cluster_errors
0 1 7722.000000
1 2 3920.862881
2 3 2942.285000
3 4 2626.441019
4 5 2415.808234
5 6 2215.240609
6 7 2089.932268
7 8 1985.145162
8 9 1871.739483
In [18]:
plt.figure(figsize=(12,6))
plt.plot( clusters_df.num_clusters, clusters_df.cluster_errors, marker = "o" )
Out[18]:
[<matplotlib.lines.Line2D at 0x15d54062388>]

Since from the above chart, we can see there are clearly two dominant clusters present hence the two type of cars

In [19]:
cluster = KMeans( n_clusters = 2, random_state = 2354 )
cluster.fit(X)
prediction= cluster.predict(X)  
df_car["group"] = prediction 
In [20]:
df_car
Out[20]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class group
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196 car 0
9 93 44.0 98.0 NaN 62.0 11 183.0 36.0 22.0 146 202.0 505.0 152.0 64.0 4.0 14.0 195.0 204 car 0
11 90 34.0 66.0 136.0 55.0 6 123.0 54.0 17.0 118 148.0 224.0 118.0 65.0 5.0 26.0 196.0 202 car 1
15 96 55.0 103.0 201.0 65.0 9 204.0 32.0 23.0 166 227.0 624.0 246.0 74.0 6.0 2.0 186.0 194 car 0
18 104 54.0 100.0 186.0 61.0 10 216.0 31.0 24.0 173 225.0 686.0 220.0 74.0 5.0 11.0 185.0 195 car 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
837 94 46.0 77.0 169.0 60.0 8 158.0 42.0 20.0 148 181.0 373.0 181.0 67.0 12.0 2.0 193.0 199 car 1
840 93 34.0 66.0 140.0 56.0 7 130.0 51.0 18.0 120 151.0 251.0 114.0 62.0 5.0 29.0 201.0 207 car 1
841 93 39.0 87.0 183.0 64.0 8 169.0 40.0 20.0 134 200.0 422.0 149.0 72.0 7.0 25.0 188.0 195 car 1
843 106 54.0 101.0 222.0 67.0 12 222.0 30.0 25.0 173 228.0 721.0 200.0 70.0 3.0 4.0 187.0 201 car 0
844 86 36.0 78.0 146.0 58.0 7 135.0 50.0 18.0 124 155.0 270.0 148.0 66.0 0.0 25.0 190.0 195 car 1

429 rows × 20 columns

In [21]:
df=pd.concat([df_others,df_car],axis=0,ignore_index=True)
In [22]:
df
Out[22]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class group
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 van NaN
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 van NaN
2 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 van NaN
3 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183 bus NaN
4 107 NaN 106.0 172.0 50.0 6 255.0 26.0 28.0 169 280.0 957.0 264.0 85.0 5.0 9.0 181.0 183 bus NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
841 94 46.0 77.0 169.0 60.0 8 158.0 42.0 20.0 148 181.0 373.0 181.0 67.0 12.0 2.0 193.0 199 car 1.0
842 93 34.0 66.0 140.0 56.0 7 130.0 51.0 18.0 120 151.0 251.0 114.0 62.0 5.0 29.0 201.0 207 car 1.0
843 93 39.0 87.0 183.0 64.0 8 169.0 40.0 20.0 134 200.0 422.0 149.0 72.0 7.0 25.0 188.0 195 car 1.0
844 106 54.0 101.0 222.0 67.0 12 222.0 30.0 25.0 173 228.0 721.0 200.0 70.0 3.0 4.0 187.0 201 car 0.0
845 86 36.0 78.0 146.0 58.0 7 135.0 50.0 18.0 124 155.0 270.0 148.0 66.0 0.0 25.0 190.0 195 car 1.0

846 rows × 20 columns

In [23]:
df['group'] = df['group'].replace(np.nan, 0)
In [24]:
df
Out[24]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class group
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 van 0.0
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 van 0.0
2 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 van 0.0
3 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183 bus 0.0
4 107 NaN 106.0 172.0 50.0 6 255.0 26.0 28.0 169 280.0 957.0 264.0 85.0 5.0 9.0 181.0 183 bus 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
841 94 46.0 77.0 169.0 60.0 8 158.0 42.0 20.0 148 181.0 373.0 181.0 67.0 12.0 2.0 193.0 199 car 1.0
842 93 34.0 66.0 140.0 56.0 7 130.0 51.0 18.0 120 151.0 251.0 114.0 62.0 5.0 29.0 201.0 207 car 1.0
843 93 39.0 87.0 183.0 64.0 8 169.0 40.0 20.0 134 200.0 422.0 149.0 72.0 7.0 25.0 188.0 195 car 1.0
844 106 54.0 101.0 222.0 67.0 12 222.0 30.0 25.0 173 228.0 721.0 200.0 70.0 3.0 4.0 187.0 201 car 0.0
845 86 36.0 78.0 146.0 58.0 7 135.0 50.0 18.0 124 155.0 270.0 148.0 66.0 0.0 25.0 190.0 195 car 1.0

846 rows × 20 columns

In [25]:
def Corgie_name(col):
    Class=col[0]
    group=col[1]
    if(Class=='van'):
        return 'Cheverolet_van'
    elif(Class=='bus'):
        return 'Double_decker_bus'
    elif(Class=='car'):
        if group==1:
            return 'Saab_9000'
        else:
            return 'Opel_Manta_400'

df['class']=df[['class','group']].apply(Corgie_name,axis=1)
In [26]:
df
Out[26]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class group
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 Cheverolet_van 0.0
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 Cheverolet_van 0.0
2 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 Cheverolet_van 0.0
3 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183 Double_decker_bus 0.0
4 107 NaN 106.0 172.0 50.0 6 255.0 26.0 28.0 169 280.0 957.0 264.0 85.0 5.0 9.0 181.0 183 Double_decker_bus 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
841 94 46.0 77.0 169.0 60.0 8 158.0 42.0 20.0 148 181.0 373.0 181.0 67.0 12.0 2.0 193.0 199 Saab_9000 1.0
842 93 34.0 66.0 140.0 56.0 7 130.0 51.0 18.0 120 151.0 251.0 114.0 62.0 5.0 29.0 201.0 207 Saab_9000 1.0
843 93 39.0 87.0 183.0 64.0 8 169.0 40.0 20.0 134 200.0 422.0 149.0 72.0 7.0 25.0 188.0 195 Saab_9000 1.0
844 106 54.0 101.0 222.0 67.0 12 222.0 30.0 25.0 173 228.0 721.0 200.0 70.0 3.0 4.0 187.0 201 Opel_Manta_400 0.0
845 86 36.0 78.0 146.0 58.0 7 135.0 50.0 18.0 124 155.0 270.0 148.0 66.0 0.0 25.0 190.0 195 Saab_9000 1.0

846 rows × 20 columns

In [27]:
df['class'].unique()
Out[27]:
array(['Cheverolet_van', 'Double_decker_bus', 'Opel_Manta_400',
       'Saab_9000'], dtype=object)

There are few missing values present in each feature so lets impute them

In [30]:
df.isnull().sum()
Out[30]:
compactness                    0
circularity                    5
distance_circularity           4
radius_ratio                   6
pr.axis_aspect_ratio           2
max.length_aspect_ratio        0
scatter_ratio                  1
elongatedness                  1
pr.axis_rectangularity         3
max.length_rectangularity      0
scaled_variance                3
scaled_variance.1              2
scaled_radius_of_gyration      2
scaled_radius_of_gyration.1    4
skewness_about                 6
skewness_about.1               1
skewness_about.2               1
hollows_ratio                  0
class                          0
group                          0
dtype: int64
In [31]:
df.median()
df_var=df.drop(['class','group'],axis=1)
df_var=df_var.apply(lambda x: x.fillna(x.median()),axis=0)
In [32]:
df_var['class']=df['class']
In [33]:
df_var
Out[33]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 Cheverolet_van
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 Cheverolet_van
2 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 Cheverolet_van
3 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183 Double_decker_bus
4 107 44.0 106.0 172.0 50.0 6 255.0 26.0 28.0 169 280.0 957.0 264.0 85.0 5.0 9.0 181.0 183 Double_decker_bus
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
841 94 46.0 77.0 169.0 60.0 8 158.0 42.0 20.0 148 181.0 373.0 181.0 67.0 12.0 2.0 193.0 199 Saab_9000
842 93 34.0 66.0 140.0 56.0 7 130.0 51.0 18.0 120 151.0 251.0 114.0 62.0 5.0 29.0 201.0 207 Saab_9000
843 93 39.0 87.0 183.0 64.0 8 169.0 40.0 20.0 134 200.0 422.0 149.0 72.0 7.0 25.0 188.0 195 Saab_9000
844 106 54.0 101.0 222.0 67.0 12 222.0 30.0 25.0 173 228.0 721.0 200.0 70.0 3.0 4.0 187.0 201 Opel_Manta_400
845 86 36.0 78.0 146.0 58.0 7 135.0 50.0 18.0 124 155.0 270.0 148.0 66.0 0.0 25.0 190.0 195 Saab_9000

846 rows × 19 columns

In [34]:
sns.pairplot(df_var, diag_kind='kde', hue = 'class') 
Out[34]:
<seaborn.axisgrid.PairGrid at 0x15d6a4a7e48>

Using a Plain SVM model on the processed data, without the help of PCA

In [37]:
target = df_var["class"]
features = df_var.drop(["class"], axis=1)
X_train, X_test, y_train, y_test = train_test_split(features,target, test_size = 0.3, random_state = 43)
X_train_z=sc.fit_transform(X_train)
X_test_z=sc.transform(X_test)
svc_model = SVC(C= .1, kernel='linear', gamma= 1)
svc_model.fit(X_train_z, y_train)

prediction = svc_model .predict(X_test_z)
In [38]:
print(svc_model.score(X_train_z, y_train))
print(svc_model.score(X_test_z, y_test))
0.9611486486486487
0.9173228346456693
In [39]:
from sklearn.metrics import accuracy_score, confusion_matrix
print("Confusion Matrix:\n",confusion_matrix(prediction,y_test))
Confusion Matrix:
 [[47  3  0  6]
 [ 1 59  0  3]
 [ 0  0 76  2]
 [ 3  1  2 51]]

Using several kernel tricks to find out which gives the highest performance

In [40]:
svc_model  = SVC(kernel='poly')
svc_model.fit(X_train_z, y_train)

prediction = svc_model.predict(X_test_z)
print(svc_model.score(X_train_z, y_train))
print(svc_model.score(X_test_z, y_test))
0.8699324324324325
0.8307086614173228
In [41]:
svc_model  = SVC(kernel='sigmoid')
svc_model.fit(X_train_z, y_train)

prediction = svc_model.predict(X_test_z)
print(svc_model.score(X_train_z, y_train))
print(svc_model.score(X_test_z, y_test))
0.6756756756756757
0.6771653543307087
In [42]:
svc_model  = SVC(kernel='rbf')
svc_model.fit(X_train_z, y_train)

prediction = svc_model.predict(X_test_z)
print(svc_model.score(X_train_z, y_train))
print(svc_model.score(X_test_z, y_test))
0.981418918918919
0.937007874015748
In [ ]:
 

We can infer from the metrics that the RBF method gives the best results

In [43]:
from sklearn.metrics import accuracy_score, confusion_matrix
print("Confusion Matrix:\n",confusion_matrix(prediction,y_test))
Confusion Matrix:
 [[47  2  0  5]
 [ 0 61  0  1]
 [ 0  0 76  2]
 [ 4  0  2 54]]
In [44]:
from scipy.stats import zscore

df_var_f = df_var.drop('class',axis=1)
df_var_z = df_var_f.apply(zscore)
array = df_var_z.values
array
Out[44]:
array([[ 0.16058035,  0.51807313,  0.05717723, ...,  0.3808703 ,
        -0.31201194,  0.18395733],
       [-0.32546965, -0.62373151,  0.12074088, ...,  0.15679779,
         0.01326483,  0.45297703],
       [-0.08244465, -0.62373151, -0.00638642, ..., -0.29134724,
         1.63964869,  1.52905585],
       ...,
       [-0.08244465, -0.94996141,  0.31143182, ...,  1.38919659,
        -0.14937355, -0.08506238],
       [ 1.49721783,  1.49676282,  1.20132288, ..., -0.96356477,
        -0.31201194,  0.72199673],
       [-0.93303214, -1.43930625, -0.26064101, ...,  1.38919659,
         0.17590322, -0.08506238]])

Using a SVM model on the processed data, with the help of PCA

In [45]:
from sklearn.decomposition import PCA
pca = PCA(10)  
projected = pca.fit_transform(array)
print(projected.shape)
(846, 10)
In [46]:
X_train, X_test, y_train, y_test = train_test_split(projected,target, test_size = 0.3, random_state = 43)
X_train_z=sc.fit_transform(X_train)
X_test_z=sc.transform(X_test)
svc_model = SVC(C= .1, kernel='linear', gamma= 1)
svc_model.fit(X_train, y_train)

prediction = svc_model .predict(X_test)
print(svc_model.score(X_train, y_train))
print(svc_model.score(X_test, y_test))
0.9290540540540541
0.9133858267716536
In [47]:
X_train, X_test, y_train, y_test = train_test_split(projected,target, test_size = 0.3, random_state = 43)
svc_model  = SVC(kernel='rbf')
svc_model.fit(X_train, y_train)

prediction = svc_model.predict(X_test)
print(svc_model.score(X_train, y_train))
print(svc_model.score(X_test, y_test))
0.9763513513513513
0.9330708661417323
In [48]:
from sklearn.metrics import accuracy_score, confusion_matrix
print("Confusion Matrix:\n",confusion_matrix(prediction,y_test))
Confusion Matrix:
 [[46  2  0  6]
 [ 0 61  0  0]
 [ 0  0 76  2]
 [ 5  0  2 54]]
In [49]:
from sklearn.ensemble import GradientBoostingClassifier

gbcl = GradientBoostingClassifier(n_estimators = 150, learning_rate = 0.05)
gbcl.fit(X_train, y_train)
print("Training Score")
print(gbcl.score(X_train , y_train))
print("Testing Score")
print(gbcl.score(X_test , y_test))
Training Score
1.0
Testing Score
0.9212598425196851
In [50]:
from sklearn.metrics import accuracy_score, confusion_matrix
print("Confusion Matrix:\n",confusion_matrix(prediction,y_test))
Confusion Matrix:
 [[46  2  0  6]
 [ 0 61  0  0]
 [ 0  0 76  2]
 [ 5  0  2 54]]
In [ ]: